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Abstract 

This  paper  gives  an  overview  of  our  work  for  related  entity  finding  which  is  proposed  in  TREC 
2010  Entity  Track.  The  goal  of  the  Entity  Track  is  to  find  the  entities  relevant  to  a  given  query  from  the 
web  corpus.  In  this  paper,  we  propose  a  bipartite  graph  reinforcement  model  for  entity  ranking.  As  is 
well  known,  the  entities  on  the  web  are  embedded  not  only  in  the  natural  language  text,  but  also  in  the 
tables  and  lists.  Given  a  query,  both  the  candidate  entities  and  relevant  tables/lists  are  extracted  from 
web  documents.  Then  the  candidate  entities  extracted  from  unstructured  text  are  ranked  based  on  a 
probabilistic  model.  But  the  result  contains  a  lot  of  noise.  If  some  candidate  entities  are  in  a  relevant 
table/list,  they  are  more  relevant  to  the  given  query.  And  Vice  versa,  if  a  table/list  contains  several 
candidate  entities,  it  is  also  more  relevant  to  the  query.  Based  on  the  above  intuition,  we  construct  a 
bipartite  graph  and  then  perform  a  reinforcement  algorithm  to  re-rank  the  candidate  entities. 

1.  Introduction 

The  World  Wide  Web  has  been  growing  rapidly  as  huge  knowledge  repository  which  contains  so 
rich  information.  Traditional  information  retrieval  systems  return  a  list  of  documents  relevant  to  user’s 
queries.  However,  users  often  show  more  interest  in  the  entities  instead  of  documents  given  their 
queries.  So  it  is  necessary  to  study  how  to  automatically  find  the  related  entities  given  a  query. 

The  main  task  of  Entity  Track  focuses  on  the  related  entity  finding  (REF):  given  an  input  entity, 
by  its  name  and  homepage,  the  type  of  the  target  entity,  as  well  as  the  nature  of  their  relation,  described 
in  free  text,  find  related  entities  that  are  of  target  type,  standing  in  the  required  relation  to  the  input 
entity [11 . 

2.  System  Architecture 

Our  approach  can  be  summarized  by  the  following  steps: 

1 .  Form  the  query  based  on  the  source  entity  and  narrative 

2.  Extract  the  relevant  text  snippets  from  the  documents  and  recognize  the  named  entities  with  the 
given  target  type 

3 .  Compute  the  initial  ranking  score  for  each  candidate  entity  based  on  a  probabilistic  model 

4 .  Extract  the  relevant  tables/lists  which  may  contain  several  target  entities  from  documents 

5 .  Construct  the  bipartite  graph  based  on  the  candidate  entities  and  relevant  tables/lists.  And  then  the 
reinforcement  learning  algorithm  is  performed  over  the  graph  to  get  the  final  score  for  each 
candidate  entity. 

6.  Find  the  homepage  for  candidate  entities 

The  following  sections  will  explain  some  key  steps  of  our  approach. 

2.1  Query  Formation 

The  query  is  formed  as  a  keyword  set  based  on  the  given  entity  and  narrative.  Words  indicating  the 
relationship  should  be  extracted  from  the  narrative.  With  a  part-of-speech  tagger,  we  parse  the  narrative 
to  get  the  verb  or  noun  to  form  the  keyword  set.  After  the  initial  keyword  set  is  formed,  we  argument  it 
with  synonyms  of  the  keywords  from  WordNet  [2]. 

2.2  Named  Entities  Identification 
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After  the  relevant  documents  are  retrieved,  we  extracted  the  relevant  text  snippets  from  the  documents. 
Then  the  Stanford  NER  Tagger1  is  used  to  extract  the  named  entities  with  the  given  target  type  from  the 
text  snippets.  At  the  same  time,  we  detect  the  boundary  of  named  entities  with  the  aid  of  anchor  text. 
After  acquiring  the  initial  list  of  candidate  entities,  we  apply  several  heuristic  rules  to  remove  the 
duplicated  entities.  Finally,  we  get  a  list  of  candidate  entities  with  their  support  snippets. 

2.3  Initial  Candidate  Entities  Ranking 

After  acquiring  the  list  of  candidate  entities,  an  initial  ranking  value  is  calculated  for  each  of  them.  We 
define  Q  =  {es,  r,  T},  where  es  denotes  the  source  entity,  r  denotes  the  relationship  between  the  source 
entity  and  the  target  entity,  and  T  denotes  the  type  of  target  entity.  And  let  e*  denotes  the  target  entity. 
The  candidate  entities  are  ranked  by  the  probability  p(e'|Q).  By  applying  Bayesian  Theorem  and  Chain 
rules,  p(e‘|Q)  could  be  decomposed  into  the  following  form: 

P(e‘  I  G)acP(et,Q) 


=  p(e',es,r,T ) 

=  P(r\e  ,e  ,T)p(es ,e  ,T) 

=  pif  I  es,e‘,T)p(T  \  es ,e')p(es ,e‘)  (!) 

=  p(r  |  es,e',T)p(T  \  es,e')p(e‘  \  es)p(es ) 

~  p(r  |  es ,e')p(T  \  e')p(e'  \  es ) 

In  equation  (1),  there  are  three  components:  p(r|es,e‘)  the  probability  that  r  is  mentioned  between  es  and 
e‘;  p(T|e*)  the  probability  that  e*  mentions  target  type  T;  p(et|ei)  the  probability  that  es  mentions  e1.  This 
step  is  very  similar  to  that  in  the  work  [3].  The  only  difference  is  the  estimation  for  the  components. 

2.4  Relevant  Candidate  Tables/Lists  Extraction 

The  information  on  the  web  can  be  organized  or  represented  as  several  forms  such  as  natural  language 
text,  table,  list  and  so  on.  In  the  task  related  entities  finding,  many  target  entities  are  found  not  only  in 
the  natural  language  text  snippet,  but  also  in  the  tables/lists.  So  the  extraction  of  relevant  tables/lists  is 
very  necessary  for  this  task.  We  define  the  relevance  of  the  table/list  as  follows: 

•  The  context  of  the  table/list  is  relevant  to  the  query 

•  The  table/list  should  contain  several  candidate  entities  extract  from  the  unstructured  text. 

Here,  we  simply  apply  some  heuristic  methods  such  as  rules  to  extract  the  relevant  tables/lists.  The 
features  used  for  extraction  of  tables/lists  are  very  similar  to  those  used  in  the  work.  [4]. 

2.5  Candidate  Entities  Re-Ranking 
2.5.1  Bipartite  Graph  Construction 

We  construct  a  bipartite  graph  based  on  the  candidate  entities  and  relevant  tables/lists.  The  candidate 
entities  and  relevant  tables/lists  are  regarded  as  the  two  disjoint  sets  of  graph  vertices.  If  one  candidate 
entity  is  in  a  relevant  table/list,  the  vertices  corresponding  to  them  will  be  connected  by  an  edge.  For 
two  vertices  ej,  lj  of  one  edge,  the  weight  of  the  edge  is  defined  as  follows: 


1  if  et  is  connected  to  l  j 
[0  otherwise 


2.5.2  Reinforcement  Learning 

Once  the  bipartite  graph  is  constructed,  we  apply  the  reinforcement  Learning  algorithm  over  it.  Similar 
to  [5],  the  Iteration  equation  is  as  follows: 


(2) 


IX+I  =  aMlLn  +  (l-a)E0 
\Ln+l  =  j3M2En+1  +  (l-j3)L0 

Mi  =  AD'1 
m2  =  ATDEl 

E0  and  L0  are  the  initial  Ranking  value  vectors  of  the  candidate  entities  and  relevant  table/list, 
respectively.  En  and  Ln  are  the  ranking  value  vectors  after  n  iterations.  A  is  the  biadjacency  matrix  of 
bipartite  graph.  AT  is  the  transpose  of  A.  DL  is  the  diagonal  matrix  with  its  (i;i)-element  equal  to  the 
sum  of  the  i-th  column  of  A;  DE  is  the  diagonal  matrix  with  its  (i;i)-element  equal  to  the  sum  of  the  i-th 
row  of  A.aand  P  are  the  weight.  After  several  iterations,  we  get  the  final  score  for  each  candidate  entity. 
2.6  Find  homepages  for  entities 

For  the  homepage  finding,  we  use  search  engine  such  as  Google  to  search  for  top-K  urls  and  some 
heuristic  rules  are  used  to  identify  the  real  homepage.  Then  we  search  for  the  corresponding  document 
from  the  corpus  with  the  url. 

3.  Experiments 
3.1  Our  Experiment 

In  the  experiment,  our  focus  was  to  retrieve  the  related  entities,  and  not  on  finding  homepages 
corresponding  to  the  related  entities.  We  computed  the  retrieval  measure  over  the  name  of  entities. 

We  evaluated  the  proposed  model  on  a  standard  collection  from  TREC  2010  Entity  Track.  The 
collection  includes:  (1)  ClueWeb09  Category  A  (about  500  million  pages)  (2)  twenty  queries  of  which 
topic  Ids  are  from  21  to  40  (3)  Relevant  entities  manually  found  from  web  for  each  query  using 
Google. 

For  each  query,  the  top  100  ranked  entities  are  returned.  We  adopt  the  P@10  and  nDCG@R  as  the 
measure  to  evaluate  the  performance.  The  ranking  strategy  based  on  the  probabilistic  model  is  chosen 
as  the  baseline.  The  result  is  as  follows: 


mean  P@10 

mean  nDCG@R 

Probabilistic  Model 

(Baseline) 

0.2325 

0.4530 

Bipartite  Graph 

Reinforcement  Model 

0.2975 

0.5150 

Table  1 :  The  results  of  our  experiment  for  entity  ranking 


From  the  results,  we  can  see  that  both  the  average  P@10  and  average  nDCG@R  are  improved.  So  the 
bipartite  graph  reinforcement  model  is  more  effective  for  the  related  entity  finding  than  the 
probabilistic  model. 

3.1  Official  Results 


P@10 

nDCGR 

MAP 

Rprec 

0.1277 

0.1611 

0.0839 

0.1305 

Table  2  :  The  official  results  for  our  run 


The  official  results  are  listed  in  Table  2.  Our  scores  are  low.  One  possible  reason  is  that  we  fail  to  find 
the  true  homepages  though  we  could  find  related  entities. 

4.  Conclusion  and  Future  Work 

In  this  paper,  we  represent  a  bipartite  graph  reinforcement  model  for  the  relate  entity  ranking.  As  is 
well  known,  the  entities  on  the  web  are  embedded  not  only  in  the  natural  language  text,  but  also  in  the 


table  or  list.  Given  a  query,  both  the  candidate  entities  and  relevant  tables/lists  are  extracted  from  web 
documents.  The  ranking  of  candidate  entities  can  benefit  from  the  relevant  tables/lists  which  may 
contain  several  target  entities.  And  Vice  versa  the  ranking  of  relevant  tables/lists  may  be  improved  if  it 
contains  several  candidate  entities.  The  mutual  reinforcement  between  the  candidate  entities  and 
relevant  tables/lists  will  improve  the  effectiveness  of  the  entity  ranking 

However,  the  performance  can  be  further  improved  with  better  techniques.  In  future  work,  we  will 
investigate  other  methods  for  NER  and  homepage  finding.  And  we  will  also  extract  the  target  entities 
from  the  tables/lists  with  high  score  to  improve  the  recall. 
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